Preparation

library(tidyverse)
## ── Attaching packages ────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.3.0
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(here)
## here() starts at /home/niklas/ds_mdm
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
load("/home/niklas/ds_mdm/NTR4_assignment2/data/assignment2.Rdata")

Background Questions

  1. The phenotypes in NHANES are measured during the home exam. This exame includes measurment of the BMI and of blood glucose based on a sample from a blood draw.
  2. Both Framingham and Nurses are covering a well-described population. However, the population charactersitics are not representative of the US general population (for example, in terms of sex, ethnicity, socioeconomic status). NHANES recruits participants by taking a top-down approach, actively aiming to preserve representation by stratified sampling
  3. EGPD-Diagram. We are measuing the influence of enviomnental agents of phenotypes that are tightly associated with complex chronic diseases such as diabetes.
  4. Below are 5 representative codes:
  5. LBXBCD Blood Cadmium levels (ug/L)
  6. LBXHPE Blood Heptachlor Epoxide, a pesticide, levels (ng/g)
  7. URXP07 Urine 2-phenanthrene, a residue of fossil fuel combustion, (ng/L) levels
  8. LBXVID Blood Vitamin D levels, measured using a commerical kit
  9. LBXPFDE Blood Perfluorodecanoic acid levels, a chemical found in surfactants

Exploring Phenotypes by demographic characteristic

More than 17k patients have BMI data in the dataset.

NHData.train$BMXBMI %>% is.na() %>% table() %>% knitr::kable()
. Freq
FALSE 17472
TRUE 3532

The BMI histogram is skewed, with a long tail towards higher values. The mean and median are shown in the plot title.

mean = NHData.train$BMXBMI %>% 
  mean(na.rm = TRUE) %>% round(2)
median = NHData.train$BMXBMI %>% 
  median(na.rm = TRUE) %>% 
  round(2)

NHData.train %>% 
  ggplot(aes(BMXBMI)) + 
  geom_histogram() + 
  theme_classic() + 
  labs(title = "Distribution of BMI in cohort",
       subtitle = paste0("Mean ", mean, " and Median ", median))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3532 rows containing non-finite values (stat_bin).

NHData.train %>% 
  ggplot(aes(RIDAGEYR, BMXBMI)) + 
  geom_point(alpha = 0.1) + 
  theme_classic() + 
  geom_smooth() + 
  labs(title = "BMI vs. Age")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 3532 rows containing non-finite values (stat_smooth).
## Warning: Removed 3532 rows containing missing values (geom_point).

NHData.train %>% 
  mutate(female = if_else(female == 1, TRUE, FALSE)) %>%
  ggplot(aes(female, BMXBMI)) + 
  geom_boxplot() + 
  theme_classic() + 
  labs(x = "Self identified female") + 
  labs(title = "BMI vs. Sex")
## Warning: Removed 3532 rows containing non-finite values (stat_boxplot).

NHData.train %>% 
  dplyr::select(BMXBMI, black:other_eth) %>% 
  mutate(ethnicity = case_when(
    black == 1 ~ "black",
    mexican == 1 ~ "mexican",
    other_hispanic == 1 ~ "other_hispanic",
    other_eth == 1 ~ "other_eth",
    TRUE ~ "white"
  )) %>%
  ggplot(aes(ethnicity, BMXBMI)) + 
  geom_boxplot() + 
  theme_classic() + 
  labs(x = "Ethnicity") + 
  labs(title = "BMI vs. Ethnicity") + 
  coord_flip()
## Warning: Removed 3532 rows containing non-finite values (stat_boxplot).

NHData.train %>% 
  ggplot(aes(INDFMPIR, BMXBMI)) + 
  geom_point(alpha = 0.1) + 
  theme_classic() + 
  geom_smooth() + 
  labs(title = "BMI vs. Income")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 5310 rows containing non-finite values (stat_smooth).
## Warning: Removed 5310 rows containing missing values (geom_point).

Summary of plots

The distribution of BMI values in the population is skewed to the left, with average population BMI on the brink of overweight (BMI = 25). The BMI increases rapidly during childhood and then keeps increasing slowly over time until age 70 before it starts decreasing.

Sex is not a driving factor of BMI differences. On average the white and mexican population has a higher BMI than the black community. The BMI is lower in individuals that live underneath the average poverty level.

Proposed statistical tests

First, I would use the stratified class of glm functions.

  • Non-normality can be tested with a Shapiro Wilk test
  • A linear model of BMI vs. age for three separate phases of life can be built (childhood, adulthood, >70)
  • A linear model of BMI vs. sex can be trained. The model will give an estimate of significance.
  • A similar linear model with multiple dumbified variables can be fitted for BMI vs. ethnicity.
  • A linear model of BMI vs. INDFMPIR can test the final hypothesis.

Modeling hypothesis

Definint training data

dsn <- svydesign(ids=~SDMVPSU,
                 strata=~SDMVSTRA,
                 weights=~WTMEC2YR,
                 nest=T, 
                 data=NHData.train)

BMI vs. age (one linear model)

mod <-svyglm(BMXBMI ~ RIDAGEYR, dsn)
summary(mod)
## 
## Call:
## svyglm(formula = BMXBMI ~ RIDAGEYR, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.662323   0.144659  142.84   <2e-16 ***
## RIDAGEYR     0.145503   0.003441   42.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 45.1212)
## 
## Number of Fisher Scoring iterations: 2

There is a significant association. One year of age corresponds to an BMI increase of 0.15kg

The influence of sex on BMI:

mod <-svyglm(BMXBMI ~ female, dsn)
summary(mod)
## 
## Call:
## svyglm(formula = BMXBMI ~ female, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.6007     0.1166  219.58  < 2e-16 ***
## female        0.4647     0.1395    3.33  0.00245 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 55.55328)
## 
## Number of Fisher Scoring iterations: 2

Being a female is significantly linked to an increased BMI. Being female increase the expected BMI by 0.46 points.

mod <-svyglm(BMXBMI ~ black + mexican + other_hispanic + other_eth, dsn)

summary(mod)
## 
## Call:
## svyglm(formula = BMXBMI ~ black + mexican + other_hispanic + 
##     other_eth, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    25.87563    0.15314 168.970  < 2e-16 ***
## black           0.63332    0.22672   2.793  0.00986 ** 
## mexican        -0.67686    0.20648  -3.278  0.00307 ** 
## other_hispanic -0.02531    0.32770  -0.077  0.93905    
## other_eth      -1.08261    0.52390  -2.066  0.04929 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 55.4537)
## 
## Number of Fisher Scoring iterations: 2

Black, Mexican and other ethncities have a significant contribution to expected BMI. Different from my visual interpretation, black ethnicity is linked to an increased BMI (+0.63 points), while Mexican ethnicity is linked to a decreased expected BMI (-0.68 points).

Repeating everything for fasting glucose

The BMI histogram is skewed, with a long tail towards higher values. The mean and median are shown in the plot title.

mean = NHData.train$LBXGLU %>% 
  mean(na.rm = TRUE) %>% round(2)
median = NHData.train$LBXGLU %>% 
  median(na.rm = TRUE) %>% 
  round(2)

NHData.train %>% 
  ggplot(aes(LBXGLU)) + 
  geom_histogram() + 
  theme_classic() + 
  labs(title = "Distribution of Plasma Glucose in cohort",
       subtitle = paste0("Mean ", mean, " and Median ", median))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14528 rows containing non-finite values (stat_bin).

NHData.train %>% 
  ggplot(aes(RIDAGEYR, LBXGLU)) + 
  geom_point(alpha = 0.1) + 
  theme_classic() + 
  geom_smooth() + 
  labs(title = "Glucose vs. Age") + 
  scale_y_log10()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 14528 rows containing non-finite values (stat_smooth).
## Warning: Removed 14528 rows containing missing values (geom_point).

NHData.train %>% 
  mutate(female = if_else(female == 1, TRUE, FALSE)) %>%
  ggplot(aes(female, LBXGLU)) + 
  geom_boxplot() + 
  theme_classic() + 
  labs(x = "Self identified female") + 
  labs(title = "Glucose vs. Sex") + 
  scale_y_log10()
## Warning: Removed 14528 rows containing non-finite values (stat_boxplot).

NHData.train %>% 
  dplyr::select(LBXGLU, black:other_eth) %>% 
  mutate(ethnicity = case_when(
    black == 1 ~ "black",
    mexican == 1 ~ "mexican",
    other_hispanic == 1 ~ "other_hispanic",
    other_eth == 1 ~ "other_eth",
    TRUE ~ "white"
  )) %>%
  ggplot(aes(ethnicity, LBXGLU)) + 
  geom_boxplot() + 
  theme_classic() + 
  labs(x = "Ethnicity") + 
  labs(title = "Glucose vs. Ethnicity") + 
  coord_flip() + 
  scale_y_log10()
## Warning: Removed 14528 rows containing non-finite values (stat_boxplot).

NHData.train %>% 
  ggplot(aes(INDFMPIR, LBXGLU)) + 
  geom_point(alpha = 0.1) + 
  theme_classic() + 
  geom_smooth() + 
  labs(title = "Gluose vs. Income") + 
  scale_y_log10()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 15186 rows containing non-finite values (stat_smooth).
## Warning: Removed 15186 rows containing missing values (geom_point).

Summary of plots

The distribution of Glucose values in the population is skewed to the left, with average population Glucose in the normal range. Plasma Glucose values increase increase slightly around age 70.

Male sex is linked to increased Glucose levels. On average the black and mexican population has higher Glucose levels. The average Glucose readings are constant across income levels, with increased observations of high readings around the poverty line.

Proposed statistical tests

First, I would use the stratified class of glm functions.

  • Non-normality can be tested with a Shapiro Wilk test
  • A linear model of Glucose vs. age can be built
  • A linear model of Glucose vs. sex can be trained. The model will give an estimate of significance.
  • A similar linear model with multiple dumbified variables can be fitted for Glucose vs. ethnicity.
  • A linear model of Glucose vs. INDFMPIR can test the final hypothesis.

Modeling hypothesis

Definint training data

dsn <- svydesign(ids=~SDMVPSU,
                 strata=~SDMVSTRA,
                 weights=~WTMEC2YR,
                 nest=T, 
                 data=NHData.train)

Glucose vs. age (one linear model)

mod <-svyglm(LBXGLU ~ RIDAGEYR, dsn)
summary(mod)
## 
## Call:
## svyglm(formula = LBXGLU ~ RIDAGEYR, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 84.06094    1.09386   76.85  < 2e-16 ***
## RIDAGEYR     0.38988    0.02728   14.29 2.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1053.622)
## 
## Number of Fisher Scoring iterations: 2

There is a significant association. One year of age corresponds to an Glucose level increase of 0.39 mg/ml

The influence of sex on BMI:

mod <-svyglm(LBXGLU ~ female, dsn)
summary(mod)
## 
## Call:
## svyglm(formula = LBXGLU ~ female, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 102.6982     0.6920 148.418  < 2e-16 ***
## female       -4.8326     0.7977  -6.058 1.57e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1110.426)
## 
## Number of Fisher Scoring iterations: 2

Being a female is significantly linked to an decreased expected Blood Glucose. Being female decreases the expected Glucose by 4.8 mg/ml in contrast to males.

mod <-svyglm(LBXGLU ~ black + mexican + other_hispanic + other_eth, dsn)

summary(mod)
## 
## Call:
## svyglm(formula = LBXGLU ~ black + mexican + other_hispanic + 
##     other_eth, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    100.0432     0.6599 151.597   <2e-16 ***
## black           -1.4418     1.2950  -1.113    0.276    
## mexican          0.6421     1.0561   0.608    0.549    
## other_hispanic   4.3031     3.0142   1.428    0.166    
## other_eth        0.2110     3.3380   0.063    0.950    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1115.769)
## 
## Number of Fisher Scoring iterations: 2

There are no significant associations between ethnicity and Blood Glucose levels.

Comparing demographics of Blood Glucose levels and BMI

Both BMI and Glucose levels are influenced by age. As age increases so does the expected BMI and blood Glucose levels.

For both metrics we can observe a significant effect of participant’s sex. While female sex is linked to a higher BMI, it is also associated with lower expected blood Glucose levels.

In terms of ethnicity, we can observe an influence on expected BMI, but not for expected blood Glucose levels.

Exploring enviromnental variables

The distribution of serum lead is log-normal.

NHData.train %>% 
  ggplot(aes(LBXBPB)) + 
  geom_histogram() + 
  theme_classic() + 
  scale_x_log10() + 
  labs(title = "Log-Normal Serum Lead distribution",
       subtitle = "note the log10 scale on the x axis")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4089 rows containing non-finite values (stat_bin).

A possible way to transform this variable is to perform a log10 based rescaling. This makes the data more normally distributed, as it forces the more extreme values to come closer to the distribution’s mean.

Operating with normally distributed values makes the interpretation of a model’s coefficients easier. In addition to this, the chance that the normality assumption for the model’s error terms is kept is higher.

NHData.train %>% 
  ggplot(aes(RIDAGEYR, LBXBPB)) + 
  geom_point(alpha = 0.2) +
  geom_smooth() + 
  
  theme_classic() + 
  scale_y_log10() + 
  labs(title = "Log-Normal Serum lead levels vs. age",
       subtitle = "note the log10 scale on the y axis")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 4089 rows containing non-finite values (stat_smooth).
## Warning: Removed 4089 rows containing missing values (geom_point).

NHData.train %>% 
  mutate(female = if_else(female == 1, TRUE, FALSE)) %>%
  ggplot(aes(female, LBXBPB)) + 
  geom_boxplot() + 
  theme_classic() + 
  labs(x = "Self identified female") + 
  labs(title = "Serum lead levels vs. Sex") + 
  scale_y_log10()
## Warning: Removed 4089 rows containing non-finite values (stat_boxplot).

NHData.train %>% 
  dplyr::select(LBXBPB, black:other_eth) %>% 
  mutate(ethnicity = case_when(
    black == 1 ~ "black",
    mexican == 1 ~ "mexican",
    other_hispanic == 1 ~ "other_hispanic",
    other_eth == 1 ~ "other_eth",
    TRUE ~ "white"
  )) %>%
  ggplot(aes(ethnicity, LBXBPB)) + 
  geom_boxplot() + 
  theme_classic() + 
  labs(x = "Ethnicity") + 
  labs(title = "Serum lead vs. Ethnicity") + 
  coord_flip() + 
  scale_y_log10()
## Warning: Removed 4089 rows containing non-finite values (stat_boxplot).

NHData.train %>% 
  ggplot(aes(INDFMPIR, LBXBPB)) + 
  geom_point(alpha = 0.1) + 
  theme_bw() + 
  geom_smooth() + 
  labs(title = "Serum lead vs. Income",
       subtitle = "we see a slight decrease of lead levvels, as income increases") + 
  scale_y_log10()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 5692 rows containing non-finite values (stat_smooth).
## Warning: Removed 5692 rows containing missing values (geom_point).

Modeling Serum lead levels

Definint training data

dsn <- svydesign(ids=~SDMVPSU,
                 strata=~SDMVSTRA,
                 weights=~WTMEC2YR,
                 nest=T, 
                 data=NHData.train)

Serum lead vs. age (one linear model)

mod <-svyglm(LBXBPB ~ RIDAGEYR, dsn)
summary(mod)
## 
## Call:
## svyglm(formula = LBXBPB ~ RIDAGEYR, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.368228   0.060867   22.48  < 2e-16 ***
## RIDAGEYR    0.016614   0.001453   11.44 4.61e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.095356)
## 
## Number of Fisher Scoring iterations: 2

There is a significant association. One year of age corresponds to an increase in serum lead levels of 0.017.

The influence of sex on Serum lead:

mod <-svyglm(LBXBPB ~ female, dsn)
summary(mod)
## 
## Call:
## svyglm(formula = LBXBPB ~ female, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.37311    0.04347   54.59   <2e-16 ***
## female      -0.78060    0.03933  -19.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.06158)
## 
## Number of Fisher Scoring iterations: 2

Being a female is significantly linked to an decreased expected Serum lead levels. Being female decreases the expected Serum lead level by 0.78 in contrast to males.

mod <-svyglm(LBXBPB ~ black + mexican + other_hispanic + other_eth, dsn)

summary(mod)
## 
## Call:
## svyglm(formula = LBXBPB ~ black + mexican + other_hispanic + 
##     other_eth, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.916712   0.030143  63.587  < 2e-16 ***
## black           0.362254   0.049108   7.377 9.98e-08 ***
## mexican         0.267887   0.077264   3.467  0.00192 ** 
## other_hispanic  0.006708   0.084091   0.080  0.93706    
## other_eth      -0.134662   0.058414  -2.305  0.02973 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.212349)
## 
## Number of Fisher Scoring iterations: 2

Mexican and black participants have higher expected values of Serum lead (+0.27, +0.36, respectively).

Repeating everything for Vitamin D

The distribution of serum lead is log-normal.

NHData.train %>% 
  ggplot(aes(LBXVID)) + 
  geom_histogram() + 
  theme_classic() + 
  scale_x_log10() + 
  labs(title = "Log-Normal Vitamin D distribution",
       subtitle = "note the log10 scale on the x axis")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13197 rows containing non-finite values (stat_bin).

A possible way to transform this variable is to perform a log10 based rescaling. This makes the data more normally distributed, as it forces the more extreme values to come closer to the distribution’s mean.

Operating with normally distributed values makes the interpretation of a model’s coefficients easier. In addition to this, the chance that the normality assumption for the model’s error terms is kept is higher.

NHData.train %>% 
  ggplot(aes(RIDAGEYR, LBXVID)) + 
  geom_point(alpha = 0.2) +
  geom_smooth() + 
  
  theme_classic() + 
  scale_y_log10() + 
  labs(title = "Log-Normal Vitamin D levels vs. age",
       subtitle = "note the log10 scale on the y axis")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 13197 rows containing non-finite values (stat_smooth).
## Warning: Removed 13197 rows containing missing values (geom_point).

NHData.train %>% 
  mutate(female = if_else(female == 1, TRUE, FALSE)) %>%
  ggplot(aes(female, LBXVID)) + 
  geom_boxplot() + 
  theme_classic() + 
  labs(x = "Self identified female") + 
  labs(title = "Serum Vitamin D levels vs. Sex") + 
  scale_y_log10()
## Warning: Removed 13197 rows containing non-finite values (stat_boxplot).

NHData.train %>% 
  dplyr::select(LBXVID, black:other_eth) %>% 
  mutate(ethnicity = case_when(
    black == 1 ~ "black",
    mexican == 1 ~ "mexican",
    other_hispanic == 1 ~ "other_hispanic",
    other_eth == 1 ~ "other_eth",
    TRUE ~ "white"
  )) %>%
  ggplot(aes(ethnicity, LBXVID)) + 
  geom_boxplot() + 
  theme_classic() + 
  labs(x = "Ethnicity") + 
  labs(title = "Serum Vitamin D vs. Ethnicity",
       subtitle = "We can see clear differences \nin Vitamin D levels, which are directly \nlinked to the skin tone") + 
  coord_flip() + 
  scale_y_log10()
## Warning: Removed 13197 rows containing non-finite values (stat_boxplot).

NHData.train %>% 
  ggplot(aes(INDFMPIR, LBXVID)) + 
  geom_point(alpha = 0.1) + 
  theme_bw() + 
  geom_smooth() + 
  labs(title = "Serum Vitamin D vs. Income",
       subtitle = "we see a slight increase of Vitamin D levels, as income increases") + 
  scale_y_log10()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 13678 rows containing non-finite values (stat_smooth).
## Warning: Removed 13678 rows containing missing values (geom_point).

Modeling Serum lead levels

Defining training data

dsn <- svydesign(ids=~SDMVPSU,
                 strata=~SDMVSTRA,
                 weights=~WTMEC2YR,
                 nest=T, 
                 data=NHData.train)

Vitamin D vs. age (one linear model)

mod <-svyglm(LBXVID ~ RIDAGEYR, dsn)
summary(mod)
## 
## Call:
## svyglm(formula = LBXVID ~ RIDAGEYR, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.055610   0.464486   53.94  < 2e-16 ***
## RIDAGEYR    -0.033598   0.007164   -4.69 0.000348 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 94.46132)
## 
## Number of Fisher Scoring iterations: 2

There is a significant association. One year of age corresponds to a drop of serum Vitamin D levels of 0.03.

The influence of sex on Vitamin D:

mod <-svyglm(LBXVID ~ female, dsn)
summary(mod)
## 
## Call:
## svyglm(formula = LBXVID ~ female, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.2004     0.4120   58.73   <2e-16 ***
## female       -0.8607     0.3774   -2.28   0.0388 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 94.76924)
## 
## Number of Fisher Scoring iterations: 2

Being a female is significantly linked to decreased expected Serum Vitamin D levels. Being female decreases the expected Serum Vitamin D level by 0.86 relative to males.

mod <-svyglm(LBXVID ~ black + mexican + other_hispanic + other_eth, dsn)

summary(mod)
## 
## Call:
## svyglm(formula = LBXVID ~ black + mexican + other_hispanic + 
##     other_eth, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     25.8999     0.4438  58.364 4.62e-15 ***
## black          -11.3634     0.5077 -22.381 1.59e-10 ***
## mexican         -5.0671     0.8253  -6.140 7.31e-05 ***
## other_hispanic  -3.8189     1.0566  -3.614 0.004066 ** 
## other_eth       -5.1153     0.8765  -5.836 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 79.05292)
## 
## Number of Fisher Scoring iterations: 2

All non-white ethnicities are linked to lower expected Vitamin D levels. The reduction in Vitamin D levels is linked to the average skin tone, with black participants having the lowest average Vitamin D levels (-11) followed by Mexicans (-5).

Demographic variables as confounders

Demographic variables could be a confounder for both the association of lead to BMI and serum glucose.

Lead is an enviromnental toxin formerly emmited by gasoline combustion and now found in old paints, food and dust.

The exposure of lead has a proven gradient over socioeconomic groups, sexes and ethnicities, with poorer parts of the population, males and minorities having higher blood lead levels.

Especially BMI has a similar link to ethnicity. Thus, when modeling BMI as a function of lead exposure, models should be corrected for differences in ethnicity.

Similarly, when modeling blood glucose as a function of lead exposure, models should be corrected for differences in sex.

In the case of Vitamin D, similar confounders should be expected. Vitamin D levels are a proven function of skin tone and thus highly influenced by ethnicity. Moreover, there is a significant link between sex and Vitamin D levels.

When modeling BMI as a function of Vitamin D, we should control for ethnicity and other covariates.

When modeling serum Glucose as a function of Vitamin D, we should control for sex and other potential confounders.

The discussed variables, lead and Vitamin D, are most likely not mediators, but confounders of the observed outcomes.

Executing environment-wide associations

I source the run_ewas function I defined.

source("/home/niklas/ds_mdm/NTR4_assignment2/R/ewas.R")
library(parallel)

ExposureList = ExposureList[-133]
ExposureList = ExposureList[-150]

fasting_glucose_train <- ExposureList %>% 
  lapply(., run_ewas, 
           target = "LBXGLU",
           modification_target = "scale(log(x))",
           nhd = NHData.train,
           description = ExposureDescription) %>% 
  bind_rows() %>% 
  mutate(fdr = p.adjust(pvalue, method = "BH"))

fasting_glucose_test <- ExposureList %>% 
  lapply(., run_ewas, 
           target = "LBXGLU",
           modification_target = "scale(log(x))",
           nhd = NHData.test,
           description = ExposureDescription) %>% 
  bind_rows()%>% 
  mutate(fdr = p.adjust(pvalue, method = "BH"))

bmi_train <- ExposureList %>% 
  lapply(., run_ewas, 
           target = "BMXBMI",
           modification_target = "scale(x)",
           nhd = NHData.train,
           description = ExposureDescription) %>% 
  bind_rows()%>% 
  mutate(fdr = p.adjust(pvalue, method = "BH"))

bmi_test <- ExposureList[-150] %>%
  lapply(., run_ewas, 
           target = "LBXGLU",
           modification_target = "scale(x)",
           nhd = NHData.test,
           description = ExposureDescription) %>% 
  bind_rows()%>% 
  mutate(fdr = p.adjust(pvalue, method = "BH"))

I write my results

fasting_glucose_train %>%
  write_csv("/home/niklas/ds_mdm/NTR4_assignment2/data/fasting_glucose_train.csv")

fasting_glucose_test %>%
  write_csv("/home/niklas/ds_mdm/NTR4_assignment2/data/fasting_glucose_test.csv")

bmi_train %>% 
  write_csv("/home/niklas/ds_mdm/NTR4_assignment2/data/bmi_train.csv")

bmi_test %>% 
  write_csv("/home/niklas/ds_mdm/NTR4_assignment2/data/bmi_test.csv")

Analysis of EWAS results

library(patchwork)

fasting_glucose_train %>%
  filter(exposure_id != "URXUPT") %>%
  # The exposure URXUPT has an extreme estimated effect with a very strong std_error.
  mutate(log_pvalue = -log10(pvalue)) %>%
  ggplot(aes(estimate, log_pvalue)) + 
  geom_point() + 
  theme_classic() + 
  labs(title = "fasting glucose") +
bmi_train %>%
  mutate(log_pvalue = -log10(pvalue)) %>%
  ggplot(aes(estimate, log_pvalue)) + 
  geom_point() + 
  theme_classic() + 
  labs(title = "bmi")

The plot is showing the impact of environmental factors on our target phenotype. Exposures that do not have a significant and measurable effect are in the bottom center. Exposures that are on the extreme ends have high -log10(p) and a high effect size estimate.

It is useful to scale all non-binary variables so the effect-size estimate can be interpreted as a unit-less measure of standard deviation from the mean for a given outcome measure.

One way to identify the corresponding p-value for a given FDR is looking for the pvalue of measurments that are close to the FDR of interest.

fasting_glucose_train %>% 
  mutate(fdr_r = round(fdr, 2)) %>% 
  ggplot(aes(fdr, pvalue)) + 
  geom_point() + 
  scale_x_log10(limits = c(0.001, 1)) + 
  scale_y_log10(limits = c(0.00001, 1)) + 
  theme_bw() + 
  geom_smooth() + 
  geom_vline(xintercept = 0.05) + 
  geom_vline(xintercept = 0.01) + 
  labs(title = "Fasting Glucose") + 
bmi_train %>% 
  mutate(fdr_r = round(fdr, 2)) %>% 
  ggplot(aes(fdr, pvalue)) + 
  geom_point() + 
  scale_x_log10(limits = c(0.001, 1)) + 
  scale_y_log10(limits = c(0.00001, 1)) + 
  theme_bw() + 
  geom_smooth() + 
  geom_vline(xintercept = 0.05) + 
  geom_vline(xintercept = 0.01) + 
  labs(title = "BMI") 

In our case the p-values corresponding to an FDR of 5% and 1% are 0.0023 and 0.0003, respectively. This applies for the fasting glucose training dataset.

Identifying replicated findings

identify_consens <- function(df_train, df_test){
df_train %>% 
  filter(fdr < 0.1) %>% 
  mutate(direction_train = if_else(estimate < 0, 
                                   0, 1)) %>% 
  dplyr::select(exposure_id, direction_train) %>% 
  left_join(.,
    df_test %>% 
    filter(fdr < 0.1) %>% 
    mutate(direction_test = if_else(estimate < 0, 
                                    0, 1)) %>% 
  dplyr::select(exposure_id, direction_test), 
  by = "exposure_id") %>% 
  drop_na() %>% 
  mutate(consens = if_else(direction_train == direction_test, 1, 0)) %>% 
  filter(consens == 1) %>% 
  .$exposure_id %>% 
    return()
}

identify_consens(fasting_glucose_train,
                 fasting_glucose_test) %>% length()
## [1] 6
identify_consens(bmi_train,
                 bmi_test) %>% length()
## [1] 7

We identify 6 variables for fasting_glucose and 7 variables for bmi.

Interpreting findings

Fasting Glucose

fasting_glucose_test %>% 
  filter(exposure_id %in% identify_consens(fasting_glucose_train,
                 fasting_glucose_test)) %>% 
  arrange(fdr) %>% 
  head(3) %>% knitr::kable()
exposure_id estimate std_error pvalue exposure_name fdr
LBXGTC 0.1699684 0.0190801 0.00e+00 g-tocopherol(ug/dL) 0.0000015
LBXVID -0.1039699 0.0127786 0.00e+00 Vitamin D (ng/mL) 0.0000036
LBXCBC -0.1136925 0.0201950 1.16e-05 cis-b-carotene(ug/dL) 0.0004166

Vitamin D and B-carotene reduce the fasting glucose in the population by 0.104/0.114 standard deviations for every one-standard-deviation increase of metabolite levels.

In contrast, Tocopherol, a Vitamin E derivative, increase the fasting glucose in the population by 0.17 standard deviations for every one-standard-deviation increase of metabolite levels.

BMI

bmi_test %>% 
  filter(exposure_id %in% identify_consens(bmi_train,
                 bmi_test)) %>% 
  arrange(fdr) %>% 
  head(3) %>% knitr::kable()
exposure_id estimate std_error pvalue exposure_name fdr
URXUPT -0.1605860 0.0194449 0e+00 Platinum, urine (ng/mL) 5.5e-06
LBXVID -0.0870604 0.0115114 1e-07 Vitamin D (ng/mL) 1.1e-05
LBXGTC 0.1708868 0.0230801 2e-07 g-tocopherol(ug/dL) 1.1e-05

Vitamin D and Platinum exposure reduces the BMI in the population by 0.16/0.087 standard deviations for every one-standard-deviation increase of metabolite levels.

In contrast, Tocopherol, a Vitamin E derivative, increase the BMI in the population by 0.17 standard deviations for every one-standard-deviation increase of metabolite levels.

It is interesting to see the same metabolites being associated with these two related, but distinct, phenotypes.

I now plot the correlation of the estimates for BMI and fasting glucose

library(ggrepel)

df <- bmi_train %>% 
  filter(exposure_id != "URXUPT") %>%
  # The exposure URXUPT has an extreme estimate
  dplyr::select(exposure_id, 
                estimate_bmi = estimate) %>% 
  left_join(fasting_glucose_train %>% 
              filter(exposure_id != "URXUPT") %>%
  # The exposure URXUPT has an extreme estimate
              dplyr::select(exposure_id,
                            estimate_glucose = estimate)) 
## Joining, by = "exposure_id"
r <- cor(df$estimate_bmi, df$estimate_glucose) %>% round(2)

df %>% 
  ggplot(aes(estimate_bmi, estimate_glucose, label = exposure_id)) + 
  geom_point() +
  geom_text_repel(data = df %>% filter(abs(estimate_glucose) > .15)) + 
  geom_smooth(formula = y ~ x, method = "lm") + 
  theme_bw() + 
  labs(title = paste0("Pearson r=", r))

We can see an overall moderate to high correlation for the impact of envionmental agents on BMI and fasting glucose levels. Two agents, LBXGTC (Tocopherol) and LBXHPE (Heptachlor Epoxide, a dioxine) are linked to an increased BMI and fasting glucose. In contrast, the agent URXUCO (urine cobals) is linked to reduced fasting glucose levels, while not influencing the expected bmi in the population.

Overall, this plot supports the prior observation, that the effects of envionmental agents on BMI and fasting glucose are highly related. Given the pathopysiological link between obesity and elevated blood glucose levels, these results are not suprising.

Cross-correlation

columns <- c(identify_consens(bmi_train, bmi_test),
             identify_consens(fasting_glucose_train,
                   fasting_glucose_test)) %>% 
  unique()

cor_mat <- NHData.test %>% 
  dplyr::select(columns) %>%
  as.matrix() %>%
  cor(use = "pairwise.complete.obs") 

cor_mat %>% 
  pheatmap::pheatmap()

The observed correlations in exposome data are markedly elevated when compared to typical LD information. According to this publication the maximum correlation between genetic loci is expected to be around r=0.46. Environmental exposures can correlate with up to r=0.66. This makes the definition of causal hypothesis linking one agent to a specific phenotype increasingly difficult.

cor_mat %>% hist()

Potential sources of bias

Selection bias: Currently, they are no exposome-wide profiling methods that, comparable to a SNP-array, can profile the majority of simple and complex substances in the enviromnent. Thus, most variables that are assessed are pre-selected for financial and technical reasons.

Confounding bias: Given the high correlation between environmental variables it is very likely that a large number of agents are stemming from the same source, such as fossil fuel combustion or mining. The combination of (I) selective measurements of environmental variables because of technical constraints and (II) a strong bias towards shared sources of emission markedly increase the probability of associating confounding variables with an observed phenotype.

Reverse causality: A potential example for reverse causality could be the observation of high tocopherol levels in patients with high BMI. Patients with high BMI have a larger body fat content. Tocopherol and other lipid soluble vitamins (A, D, E, K) are stored in fatty tissue. Instead of high tocopherol levels causing an increased BMI, elevated tocopherol levels could be a consequence of increased fatty tissue in an obese participants organism.

Other applications

EWAS related variables do not necessarily need to be measured by body fluid samples. Instead, behavioral data, such as GPS traces can be linked to physical samples such air quality or soil contaminants instead. While the approach to data collection is different, the underlying analysis strategy would be very similar to an EWAS.

GWAS studies are mostly performed on SNP-array data. However, also variants from exome-sequencing data can be used for analysis. This can be relevant for linking somatic mutations, for example in cancer, to an outcome.

Machine learning - prediction of exposome indicators in body mass index

Defining the coefficient of variation

R-squared is defined as 1 - the ratio of the residual sum-of-squares (SSres) and the total sum-of-squares (SStot).

SStot is defined as the sum of squared differences between the mean of a dataset and the individual elements of the dataset. It is closely related to the variance.

SSres is defined as the sum of squared differences between the predicted value of an individual element and the corresponding ground truth in a dataset.

A model with a high R-sqaured coefficient of variance has a very low ratio of SSres to SStot. Most of the variance in the data is captured in the model. The variance of the model’s prediction and ground-truth is very small in comparisson.

Estimating COV for demographic variables

target = "scale(BMXBMI)"

covariates <- c("female",
                "black", 
                "mexican",
                "other_hispanic",
                "other_eth",
                "scale(RIDAGEYR)",
                "scale(INDFMPIR)")

input <- c(covariates)
 
 f <- as.formula(
    paste(
      paste0(target," ~"), 
      paste(input, collapse="+")
      )
     )

model_train <- lm(formula = f,
            data = NHData.train)
model_test <- lm(formula = f,
            data = NHData.test)

# The coefficient of variance is: 
r2_basis <- c(
summary(model_train)$r.squared,
summary(model_test)$r.squared)

r2_basis
## [1] 0.2343085 0.2198071
r2_basis %>% mean()
## [1] 0.2270578

Adding the strongest finding results in the following r-squared:

target = "scale(BMXBMI)"

covariates <- c("female",
                "black", 
                "mexican",
                "other_hispanic",
                "other_eth",
                "scale(RIDAGEYR)",
                "scale(URXUPT)", 
            #   "scale(LBXVID)", 
            #    "scale(LBXGTC)",
                "scale(INDFMPIR)")

input <- c(covariates)
 
 f <- as.formula(
    paste(
      paste0(target," ~"), 
      paste(input, collapse="+")
      )
     )

model_train <- lm(formula = f,
            data = NHData.train)
model_test <- lm(formula = f,
            data = NHData.test)

model <- lm(formula = f,
            data = NHData.train)

# The coefficient of variance is: 
r2_top1 <- c(
summary(model_train)$r.squared,
summary(model_test)$r.squared)

r2_top1
## [1] 0.1619781 0.1309004
r2_top1 %>% mean()
## [1] 0.1464392

The R-squared of the model that included the top-variable dropped! The hit from our analysis is actually decreasing the model’s fit by a 0.006 of variance.

target = "scale(BMXBMI)"

covariates <- c("female",
                "black", 
                "mexican",
                "other_hispanic",
                "other_eth",
                "scale(RIDAGEYR)",
                "scale(URXUPT)", 
              "scale(LBXVID)", 
                "scale(LBXGTC)",
                "scale(INDFMPIR)")

input <- c(covariates)
 
 f <- as.formula(
    paste(
      paste0(target," ~"), 
      paste(input, collapse="+")
      )
     )

model_train <- lm(formula = f,
            data = NHData.train)
model_test <- lm(formula = f,
            data = NHData.test)

model <- lm(formula = f,
            data = NHData.train)

# The coefficient of variance is: 
r2_top1 <- c(
summary(model_train)$r.squared,
summary(model_test)$r.squared)

r2_top1
## [1] 0.2325259 0.2058936
r2_top1 %>% mean()
## [1] 0.2192098

The model that includes our top 3-variables has a better fit to the data than the 1-variable model, but a comparable performance to the baseline. The performance difference is -0.08.

Interpreting model perfomance

The model’s performance does not markedly increase by adding the 3 covariates.

On a new dataset I would expect there to be no benefit to the explained variance, as the variables have been picked based on the present dataset and there is a considerable risk of overfitting.

Bonus

Defining a cross-sectional study

A cross-sectional study measures the prevalence of a set of phenotypes, conditions and attributes at a single moment in time in a specific population. In contrast to longitudonal study designs, which follow-up on participants over time, the ability to generate causal hypothesis is limited.

Considerations: time to death

We can implement three different modeling approches: 1. We can fit a logistic regression to identify individuals that die before a given age/timepoint 1. We can fit a linear regression to predict the time of survival after observing the patient 1. We can fit a cox proportional hazard model to predict survival, assuming a proportional hazard for different subgroups in the cohort.

Modeling: time to death

covariates <- c("female",
                "black", 
                "mexican",
                "other_hispanic",
                "other_eth",
                "scale(RIDAGEYR)",
            #   "scale(URXUPT)", 
            #   "scale(LBXVID)", 
            #    "scale(LBXGTC)",
                "scale(INDFMPIR)")

input <- c(covariates)
 
 f <- as.formula(
    paste(
      paste0("PERMTH_EXM ~"), 
      paste(input, collapse="+")
      )
     )

dsn <- svydesign(ids=~SDMVPSU,
                   strata=~SDMVSTRA,
                   weights=~WTMEC2YR,
                   nest=T, 
                   data=NHData.train %>%
                   dplyr::filter(MORTSTAT == 1))
 
model_lm <- svyglm(f, 
                    design = dsn)

summary(model_lm)
## 
## Call:
## svyglm(formula = f, design = dsn)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train %>% dplyr::filter(MORTSTAT == 
##         1))
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      41.4170     1.6669  24.847   <2e-16 ***
## female            0.9836     1.8982   0.518   0.6095    
## black            -5.3715     2.4341  -2.207   0.0381 *  
## mexican          -1.4913     3.6071  -0.413   0.6833    
## other_hispanic    3.9892     5.5588   0.718   0.4805    
## other_eth       -10.0868     9.0785  -1.111   0.2785    
## scale(RIDAGEYR)  -0.3299     1.1754  -0.281   0.7816    
## scale(INDFMPIR)  -0.9955     0.8714  -1.142   0.2656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 591.7718)
## 
## Number of Fisher Scoring iterations: 2

Trying to model the time to death using a stratified linear model only returns black ethnicity as a significant coefficient.

As an alternative, I fit a logistic regression model on the whole training dataset and try to predict the death of participants at the end of the follow-up period (below). This model has multiple variables with significant contribution, including age, sex and income.

covariates <- c("female",
                "black", 
                "mexican",
                "other_hispanic",
                "other_eth",
                "scale(RIDAGEYR)",
            #    "scale(URXUPT)", 
            #   "scale(LBXVID)", 
            #    "scale(LBXGTC)",
                "scale(INDFMPIR)")

input <- c(covariates)
 
 f <- as.formula(
    paste(
      paste0("MORTSTAT ~"), 
      paste(input, collapse="+")
      )
     )
 
 dsn <- svydesign(ids=~SDMVPSU,
                   strata=~SDMVSTRA,
                   weights=~WTMEC2YR,
                   nest=T, 
                   data=NHData.train)
 
model_log <- svyglm(f, 
                    design = dsn,
                    family=quasibinomial)
## Warning in summary.glm(g): observations with zero weight not used for
## calculating dispersion
## Warning in summary.glm(glm.object): observations with zero weight not used
## for calculating dispersion
summary(model_log)
## 
## Call:
## svyglm(formula = f, design = dsn, family = quasibinomial)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.77028    0.19154 -24.905  < 2e-16 ***
## female          -0.57075    0.13131  -4.346 0.000259 ***
## black            0.33367    0.14660   2.276 0.032931 *  
## mexican         -0.18309    0.23022  -0.795 0.434960    
## other_hispanic  -0.14827    0.28689  -0.517 0.610457    
## other_eth       -0.65125    0.33934  -1.919 0.068030 .  
## scale(RIDAGEYR)  2.15831    0.10635  20.294 9.82e-16 ***
## scale(INDFMPIR) -0.39616    0.06518  -6.078 4.07e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 1.398284)
## 
## Number of Fisher Scoring iterations: 7

Now I include my hit covariates into the model

covariates <- c("female",
                "black", 
                "mexican",
                "other_hispanic",
                "other_eth",
                "scale(RIDAGEYR)",
                "scale(URXUPT)", 
              "scale(LBXVID)", 
               "scale(LBXGTC)",
                "scale(INDFMPIR)")

input <- c(covariates)
 
 f <- as.formula(
    paste(
      paste0("MORTSTAT ~"), 
      paste(input, collapse="+")
      )
     )
 
 dsn <- svydesign(ids=~SDMVPSU,
                   strata=~SDMVSTRA,
                   weights=~WTMEC2YR,
                   nest=T, 
                   data=NHData.train)
 
model_log <- svyglm(f, 
                    design = dsn,
                    family=quasibinomial)

summary(model_log)
## 
## Call:
## svyglm(formula = f, design = dsn, family = quasibinomial)
## 
## Survey design:
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = NHData.train)
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -5.8560     1.3797  -4.244 0.008136 ** 
## female           -0.8755     0.2966  -2.951 0.031835 *  
## black             0.3327     0.4188   0.794 0.462996    
## mexican          -0.3337     0.6755  -0.494 0.642290    
## other_hispanic   -0.2362     0.3887  -0.608 0.569913    
## other_eth       -14.5529     0.4148 -35.083 3.54e-07 ***
## scale(RIDAGEYR)   2.5027     0.3261   7.675 0.000598 ***
## scale(URXUPT)   -26.3074    78.5743  -0.335 0.751358    
## scale(LBXVID)    -0.1123     0.1715  -0.655 0.541423    
## scale(LBXGTC)     0.2568     0.1433   1.792 0.133079    
## scale(INDFMPIR)  -0.5365     0.1669  -3.215 0.023608 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 4.013365)
## 
## Number of Fisher Scoring iterations: 17

Our candidate variables do not add a significant coefficient to the model.

Finally, I fit a cox proportional hazard model to the data.

library(survminer)
## Loading required package: ggpubr
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
covariates <- c("female",
                "black", 
                "mexican",
                "other_hispanic",
                "other_eth",
                "scale(RIDAGEYR)",
              #   "scale(URXUPT)", 
              # "scale(LBXVID)", 
              #  "scale(LBXGTC)",
                "scale(INDFMPIR)")

input <- c(covariates)

 f <- as.formula(
    paste(
      paste0("Surv(PERMTH_EXM, MORTSTAT)  ~"), 
      paste(input, collapse="+")
      )
     )
 
model_cox <- coxph(f, data = NHData.train)

summary(model_cox)
## Call:
## coxph(formula = f, data = NHData.train)
## 
##   n= 9445, number of events= 729 
##    (11559 observations deleted due to missingness)
## 
##                     coef exp(coef) se(coef)      z Pr(>|z|)    
## female          -0.51210   0.59924  0.07582 -6.754 1.44e-11 ***
## black            0.23687   1.26728  0.10029  2.362   0.0182 *  
## mexican         -0.24172   0.78528  0.11059 -2.186   0.0288 *  
## other_hispanic  -0.22992   0.79460  0.19579 -1.174   0.2403    
## other_eth       -0.52239   0.59310  0.32137 -1.626   0.1041    
## scale(RIDAGEYR)  2.03086   7.62065  0.07271 27.933  < 2e-16 ***
## scale(INDFMPIR) -0.36441   0.69461  0.04594 -7.932 2.16e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                 exp(coef) exp(-coef) lower .95 upper .95
## female             0.5992     1.6688    0.5165    0.6952
## black              1.2673     0.7891    1.0411    1.5425
## mexican            0.7853     1.2734    0.6323    0.9753
## other_hispanic     0.7946     1.2585    0.5414    1.1663
## other_eth          0.5931     1.6861    0.3159    1.1135
## scale(RIDAGEYR)    7.6207     0.1312    6.6085    8.7878
## scale(INDFMPIR)    0.6946     1.4397    0.6348    0.7601
## 
## Concordance= 0.866  (se = 0.006 )
## Rsquare= 0.15   (max possible= 0.747 )
## Likelihood ratio test= 1536  on 7 df,   p=<2e-16
## Wald test            = 996.6  on 7 df,   p=<2e-16
## Score (logrank) test = 1541  on 7 df,   p=<2e-16

This (above) is the cox model without environmental covariates. Most demographic variables influence the hazard of death.

covariates <- c("female",
                "black", 
                "mexican",
                "other_hispanic",
                "other_eth",
                "scale(RIDAGEYR)",
                "scale(URXUPT)",
              #"scale(LBXVID)",
              # "scale(LBXGTC)",
                "scale(INDFMPIR)")

input <- c(covariates)

 f <- as.formula(
    paste(
      paste0("Surv(PERMTH_EXM, MORTSTAT)  ~"), 
      paste(input, collapse="+")
      )
     )
 
model_cox <- coxph(f, data = NHData.train)

summary(model_cox)
## Call:
## coxph(formula = f, data = NHData.train)
## 
##   n= 2960, number of events= 221 
##    (18044 observations deleted due to missingness)
## 
##                    coef exp(coef) se(coef)      z Pr(>|z|)    
## female          -0.5282    0.5897   0.1394 -3.788 0.000152 ***
## black            0.3134    1.3681   0.1762  1.779 0.075306 .  
## mexican         -0.2434    0.7840   0.1942 -1.253 0.210147    
## other_hispanic  -0.6007    0.5484   0.3936 -1.526 0.126927    
## other_eth        0.1287    1.1373   0.5155  0.250 0.802885    
## scale(RIDAGEYR)  2.0993    8.1604   0.1353 15.518  < 2e-16 ***
## scale(URXUPT)    0.6359    1.8888   0.2286  2.781 0.005414 ** 
## scale(INDFMPIR) -0.4830    0.6169   0.0842 -5.736 9.67e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                 exp(coef) exp(-coef) lower .95 upper .95
## female             0.5897     1.6958    0.4487    0.7750
## black              1.3681     0.7309    0.9685    1.9325
## mexican            0.7840     1.2755    0.5358    1.1471
## other_hispanic     0.5484     1.8234    0.2536    1.1861
## other_eth          1.1373     0.8793    0.4141    3.1235
## scale(RIDAGEYR)    8.1604     0.1225    6.2598   10.6380
## scale(URXUPT)      1.8888     0.5294    1.2066    2.9567
## scale(INDFMPIR)    0.6169     1.6209    0.5231    0.7276
## 
## Concordance= 0.874  (se = 0.011 )
## Rsquare= 0.152   (max possible= 0.686 )
## Likelihood ratio test= 487.8  on 8 df,   p=<2e-16
## Wald test            = 315.7  on 8 df,   p=<2e-16
## Score (logrank) test = 507.5  on 8 df,   p=<2e-16

None of the top 3 BMI related exposures increases the explained variance, when added all at the same time. After restricting the variables to only one BMI-linked feature, Urine Platinum levels, this variable has a significant coefficient.

The hazard ratio of scaled Urine Platinum is 1.88. An increase in urine platinum levels by one standard deviation increases the hazard of death at a given timepoint by 88%.

Below I show an example of a cox model with no significant coefficient for the shared variable LBXGTC.

covariates <- c("female",
                "black", 
                "mexican",
                "other_hispanic",
                "other_eth",
                "scale(RIDAGEYR)",
               # "scale(URXUPT)",
              #"scale(LBXVID)",
               "scale(LBXGTC)",
                "scale(INDFMPIR)")

input <- c(covariates)

 f <- as.formula(
    paste(
      paste0("Surv(PERMTH_EXM, MORTSTAT)  ~"), 
      paste(input, collapse="+")
      )
     )
 
model_cox <- coxph(f, data = NHData.train)

summary(model_cox)
## Call:
## coxph(formula = f, data = NHData.train)
## 
##   n= 8340, number of events= 587 
##    (12664 observations deleted due to missingness)
## 
##                     coef exp(coef) se(coef)      z Pr(>|z|)    
## female          -0.55842   0.57211  0.08490 -6.578 4.78e-11 ***
## black            0.21240   1.23664  0.11177  1.900   0.0574 .  
## mexican         -0.29675   0.74323  0.12971 -2.288   0.0221 *  
## other_hispanic  -0.31050   0.73308  0.22008 -1.411   0.1583    
## other_eth       -0.44525   0.64066  0.35945 -1.239   0.2155    
## scale(RIDAGEYR)  2.07165   7.93788  0.08206 25.246  < 2e-16 ***
## scale(LBXGTC)    0.05565   1.05723  0.02956  1.883   0.0597 .  
## scale(INDFMPIR) -0.38940   0.67746  0.05157 -7.551 4.32e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                 exp(coef) exp(-coef) lower .95 upper .95
## female             0.5721     1.7479    0.4844    0.6757
## black              1.2366     0.8086    0.9934    1.5395
## mexican            0.7432     1.3455    0.5764    0.9584
## other_hispanic     0.7331     1.3641    0.4762    1.1285
## other_eth          0.6407     1.5609    0.3167    1.2960
## scale(RIDAGEYR)    7.9379     0.1260    6.7586    9.3229
## scale(LBXGTC)      1.0572     0.9459    0.9977    1.1203
## scale(INDFMPIR)    0.6775     1.4761    0.6123    0.7495
## 
## Concordance= 0.872  (se = 0.007 )
## Rsquare= 0.143   (max possible= 0.709 )
## Likelihood ratio test= 1287  on 8 df,   p=<2e-16
## Wald test            = 826.2  on 8 df,   p=<2e-16
## Score (logrank) test = 1293  on 8 df,   p=<2e-16